On the Product Rule for Classification Problems 



Marcelo Cicconet 

New York University 
cicconet® gmail.com 



Keywords: Supervised Learning; Classification; Product Rule. 

Abstract: We discuss theoretical aspects of the product rule for classification problems in supervised machine learning 
for the case of combining classifiers. We show that (1) the product rule arises from the MAP classifier suppos- 
ing equivalent priors and conditional independence given a class; (2) under some conditions, the product rule 
is equivalent to minimizing the sum of the squared distances to the respective centers of the classes related with 
different features, such distances being weighted by the spread of the classes; (3) observing some hypothesis, 
the product rule is equivalent to concatenating the vectors of features. 



1 Introduction 

With the advance of the Machine Learning field, 
and the discovery of many different techniques, the 
subject of combining multiple learners [2 | eventually 
drove attention, in particular the problem of combin- 
ing classifiers. Many different methods appeared, and 
soon they were compared in terms of efficiency in 
solving problems. 

The product rule has been present in some of these 
works (e.g., flT] |7] [3] |6] [5l |4j [8)), in contexts ranging 
from the accuracy of the different combination rules 
to some analytical properties of the different methods. 

In it was shown that, in the context of hand- 
written digit recognition, the product rule performs 
better for combining linear classifiers. In general, 
however, the product rule does not stand out from 
competitors (6). For the problem of combining au- 
dio and video signals in guitar-chord recognition, the 
product rule is better then the sum rule [5 1, but on the 
problem of identity verification using face and voice 
profiles, the sum rule wins Q. 

On the theoretical realm, [ 1 1 shows that for prob- 
lems with two classes, the sum and product rules are 
equivalent when using two classifiers and the sum of 
the estimates of the a posteriori probabilities is equal 
to one. In [17], the product rule is derived from the 
hypothesis of conditional statistical independence be- 
tween different representations of the data. There are 
also some intuitive explanations for the choice of the 
product rule, as for instance the fact that the product 
("END" operator) is preferred with respect to the sum 
rule ("OR" operator) because it enforces all qualities 
defined by the measures at once (9). 



In this text, analytical properties of the product 
rule are further analyzed, in the contexts of two or 
more classifiers. We show that (1) the product rule 
arises from the MAP classifier supposing equivalent 
priors and conditional independence given a class; 
(2) under some conditions, the product rule is equiva- 
lent to minimizing the sum of the squared distances to 
the respective centers of the classes related with dif- 
ferent features, such distances being weighted by the 
spread of the classes; (3) observing some hypothe- 
sis, the product rule is equivalent to concatenating the 
vectors of features. 

Our work extends the current theoretical under- 
standing of the product rule provided by Alexandre 
et al [jT) and Kittler et al Q, as it was made in the 
direction of the sum rule by Li and Zong [ 8 1 . 



2 Theoretical Facts 

Definition 1. Let X,Y be (continuous) random vari- 
ables corresponding to 2 distinct feature vectors, and 
C the (discrete) random variable corresponding to 
the class, whose output can be ci,...,Cg-. For any 
Z 6 {^,5 / } and k £ {1, . . . ,K}, let pz t k be a function 
that outputs the confidence that the class is Ck consid- 
ering that the features-variable is Z. Supposing that 
the features are X — x and Y —y, the product rule for 
classification will assign C = c~ k provided 

Px,k( x )-PY.k(y) = , max ^Px.k(x) -pY.k{y) ■ 

k= 1 ... ..A 

In this definition and in the following results we 
are using, for simplicity, only two random variables, 



named X and Y. We could have used, instead, a set 
of N random variables, say X\...,X N , but that would 
unnecessarily overload the notation. 

Definition 2. Let (X,Y) be the random variable ob- 
tained by concatenating the features X and Y, and 
p(-\C = c k ) the density function for the variable 
(X,Y) conditioned to C = c k . We will denote the value 
of this function at the point (x,y) by p(X = x,Y = 
y\C = Ck). Let P(C — c k ) be the prior probability that 
the class is C — c k . 

Finally, let us define P(x,Y),k{ x ^y) as follows: 

P(x,Y),k(x,y) = p(x =x,Y =y\C = c k ) ■ P(C = c k ) . 

Given a sampled value (X,Y) = (x,y), the MAP 
(Maximum a Posteriori) classifier will assign C = c~ k 
provided 

P(x,Y),k( x ^y) = , ^^P(xj)A x ^y) 

x ' K— 1 , . . . , A 

Fact 1. When using the MAP classifier, the product 
rule arises under the hypothesis of (1) conditional in- 
dependency given the class and (2) same prior prob- 
ability for the classes. 

Proof. The MAP classifier is given by 

p (X=x,Y=y\C = c k )-P(C = c k ) . 

Now hypothesis 1 means 

p(X=x,Y=y\C = c k ) = 
= p(X=x\C = c k ) ■ P (Y = y\C = c k ) , 

and hypothesis 2 implies that P(C — c~ k ) — P(C = c~ k ) 
for all k, k = l,...,K. Therefore 

mz*k=i,...,KP(x,Y)jz(x,y) = 
= imx k= i K p(X =x\C = c k ) -p(Y =y\C = c k ) , 

which is the product rule (see definition [TJ for 

Px,k[x)=p{X=x\C = c k ) and p Y , k (y) = p(Y = y\C = 
c k ). 

□ 

Fact 2. For each Z G {X,T}, let dz be the (finite) 
dimension of the variable Z, ld z the identity matrix of 
dimensions dz x dz, and l.z, k = o| k Id z ( where <3z,k W 
positive number). Also, for each k = 1 , . . . , K, let pz.k 
be fixed points in M. dz . 

Defining confidence functions (see definition^ 

p xk ( x ) = e -i(*-^) T £«(*-A<«) f and (i) 

pr,k{y) = e^-mV^-i**) , (2) 

the product rule is equivalent to 



mm 



k=l,...,Ka 



-r-\\x-nxM\ + — 5— lly — 



That is, supposing gaussian-like classifiers with co- 
variances parallel to the axis, the product rule tries 
to minimize the sum of the squared distances to the 
respective "centers" of classes for X andY , such dis- 
tances being weighted by the inverse of the "spread" 
of the the classes ( an intuitively reasonable strategy, 
in fact). 

Proof. Under the mentioned hypothesis, we have 

maxfc=i ,...,KPx,k{x) -PY,k(y) = 



max i=1 K e 



2a X.k 2a Y.k 



y-VY,k\\ 



Applying log and multiplying by 2 the second mem- 
ber of the above equality results in 



max4 = i,,..,A" px,k(x) -PY,k{y) 
min^i ,,..jc-4-\\x ~Hx,k\\ 2 -+ 



\\y-p-Y,i 



□ 



Fact 3. Let us now define confidence functions as fol- 
lows: 



Px,k(x) = 
PY,k(y) 



1 



(27l)*|^|V2 
1 



k*-« 1 *) T V*(*-«.*) , and 



(2:r)^|E F ,,|i/2 

where, for each Z G {X,Y}, \Lz,k\ i5 the determinant 
of T.z,k- Let us suppose also that, conditioned to the 
class Cj, X and Y are uncorrelated, that is, being Y. k 
the covariance of (X ,Y)\C = c k , we can write 



Z k = 



%x,k 








where, for each Z S {X,Y}, ~Lz t k !i the covariance of 
Z\C = c k . Then, putting pj = (jix,j,PY,j)r we have 

px,k(x) ■ p Y ,k(y) = 



(27i) d x+ d r|I t |i/2 



-U(x,y)- Mk ) r I.J 1 ((x,y)-t lk ) 



That is, supposing gaussian classifiers, the product 
rule is equivalent to learning using the concatenated 
vectors of features. 

Proof. The inverse of L k is 



^k 



^X.k 



r-1 
"Y,k 



This way, the expression 

(x - P-x,k) J Z x \ {x-p x ,k) + (y- PY,k) J Z Y I (y - p Y ,k) 



reduces to 



X.k 



J Y.k 



((x,y) -p-k) T Z k l ((x,y) -p k ) ■ 



Now 



1 



1 



1 



(210*12^/2 (2jt)*|Z Ki ,|V2 (2;t)*+*|Z fc |i/2 
Therefore 

px,k(x) ■ pr,k(y) = 



1 -^.vJ-m) 7 ^ 1 ((*■>')-») 



(27t) rf x+''y|I t |l/2 



□ 



3 Discussion 

According to FactQ] the product rule arises when 
maximizing the posterior under the hypothesis of 
equivalent priors and conditional independence given 
a class. We have just seen (Fact [3) that, supposing 
only uncorrelation (which is less then independency), 
the product rule appears as well. But in fact we have 
used gaussian classifiers, i.e., we supposed the data 
was normally distributed. This is in accordance with 
the fact that normality and uncorrelation implies in- 
dependency. 

An important consequence of Fact[3]has to do with 
the curse of dimensionality. If there is strong evidence 
that the conditional joint distribution of (X,Y) given 
any class C = is well approximated by a normal 
distribution, and that X\C — c# and Y\C = c\ are mi- 
correlated, than the product rule is an interesting op- 
tion, because we do not have to deal with a feature 
vector with dimension larger the largest of the dimen- 
sions of the original descriptors. Besides, the product 
rule allows parallelization. 
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